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Abstract 

In  this  work,  we  propose  a  novel,  implicitly- 
defined  neural  network  architecture  and  de¬ 
scribe  a  method  to  compute  its  components. 

The  proposed  architecture  forgoes  the  causal¬ 
ity  assumption  previously  used  to  formulate 
recurrent  neural  networks  and  allow  the  hid¬ 
den  states  of  the  network  to  coupled  together, 
allowing  potential  improvement  on  problems 
with  complex,  long-distance  dependencies. 
Initial  experiments  demonstrate  the  new  archi¬ 
tecture  outperforms  both  the  Stanford  Parser 
and  a  baseline  bidirectional  network  on  the 
Penn  Treebank  Part-of-Speech  tagging  task 
and  a  baseline  bidirectional  network  on  an  ad¬ 
ditional  artificial  random  biased  walk  task. 

1  Introduction 

Feedforward  neural  networks  were  designed  to  approx¬ 
imate  and  interpolate  functions.  Recurrent  Neural  Net¬ 
works  (RNNs)  were  developed  to  predict  sequences. 
RNNs  can  be  ‘unwrapped’  and  thought  of  as  very 
deep  feedforward  networks,  with  each  layer  sharing  the 
same  set  of  weights.  Computation  proceeds  one  step 
at  a  time,  like  the  trajectory  of  an  ordinary  differential 
equation  when  solving  an  initial  value  problem.  The 
path  of  an  initial  value  problem  depends  only  on  the 
current  state  and  the  current  value  of  the  forcing  func¬ 
tion.  In  a  RNN,  the  analogy  is  the  current  hidden  state 
and  the  current  input  sequence.  However,  in  certain 
applications  in  natural  language  processing,  especially 
those  with  long-distance  dependencies  or  where  gram¬ 
mar  matters,  sequence  prediction  may  be  better  thought 
of  as  a  boundary  value  problem.  Changing  the  value  of 
the  forcing  function  (analogously,  of  an  input  sequence 
element)  at  any  point  in  the  sequence  will  affect  the 
values  everywhere  else.  The  bidirectional  network  ad¬ 
dresses  this  problem  by  allowing  information  to  flow  in 
both  directions.  However,  each  part  can  only  consider 
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information  from  one  direction.  In  practice  many  algo¬ 
rithms  require  more  than  two  passes  through  the  data 
to  determine  an  answer.  We  provide  a  different  mech¬ 
anism  than  the  bidirectional  network,  with  the  moti¬ 
vation  being  a  program  which  iterates  over  itself  until 
convergence. 

1.1  Related  Work 

Bidirectional,  long-distance  dependencies  in  sequences 
have  been  an  issue  as  long  as  there  have  been  NLP 
tasks,  and  there  are  many  approaches  to  dealing  with 
them. 

Hidden  Markov  models  (HMMs)  (Rabiner,  1989) 
have  been  used  extensively  for  sequence-based  tasks, 
but  they  rely  on  the  Markov  assumption  -  that  a  hid¬ 
den  variable  changes  its  state  based  only  on  its  current 
state  and  observables.  In  finding  maximum  likelihood 
state  sequences,  the  Forward-Backward  algorithm  can 
take  into  account  the  entire  set  of  observables,  but  the 
underlying  model  is  still  local. 

In  recent  years,  popularity  of  the  Long  Short- 
Term  Memory  (LSTM)  (Hochreiter  and  Schmidhuber, 
1997)  and  variants  such  as  the  Gated  Recurrent  Unit 
(GRU)  (Cho  et  al.,  2014)  has  soared,  as  they  enable 
RNNs  to  process  long  sequences  without  the  problem 
of  vanishing  or  exploding  gradients  (Pascanu  et  al., 
2013).  However,  these  models  only  allow  for  informa¬ 
tion/gradient  information  to  flow  in  the  forward  direc¬ 
tion. 

The  Bidirectional  LSTM  (b-LSTM)  (Graves  and 
Schmidhuber,  2005),  a  natural  extension  of  (Schuster 
and  Paliwal,  1997),  incorporates  past  and  future  hid¬ 
den  states  via  two  separate  recurrent  networks,  allow¬ 
ing  information/gradients  to  flow  in  both  directions  of 
a  sequence.  This  is  a  very  loose  coupling,  however. 

In  contrast  to  these  methods,  our  work  goes  a  step 
further,  fully  coupling  the  entire  sequences  of  hidden 
states  of  an  RNN.  Our  work  is  similar  to  (Finkel  et  al., 
2005),  which  augments  a  CRF  with  long-distance  con¬ 
straints.  However,  our  work  differs  in  that  we  extend 
an  RNN  and  uses  Netwon-Krylov  (Knoll  and  Keyes, 
2004)  instead  of  Gibbs  Sampling. 


2  The  Implicit  RNN  (iRNN) 

2.1  Traditional  Recurrent  Neural  Networks 

A  typical  recurrent  neural  network  has  a  (possibly 
transformed)  input  sequence  [£i,  £2?  •  •  •  >  £n]  and  initial 
state  hs  and  iteratively  produces  future  states: 

hi  =  f(€i,ha) 

h2  =  /(6A) 

hn  —  f(fimhn— 1) 

The  LSTM,  GRU,  and  related  variants  follow  this 
formula,  with  different  choices  for  the  state  transition 
function.  Computation  proceeds  linearly,  with  each 
next  state  depending  only  on  inputs  and  previously 
computed  hidden  states. 


Figure  1:  Traditional  RNN  structure. 


2.2  Proposed  Architecture 

In  this  work,  we  relax  this  assumption  by  allowing 
ht  =  /(£t,  ht-i,ht+i)1.  This  leads  to  an  implicit  set 
of  equations  for  the  entire  sequence  of  hidden  states, 
which  can  be  thought  of  as  a  single  tensor  H : 

H  =  [h1,h2,...,hn] 


This  yields  a  system  of  nonlinear  equations.  This  setup 
has  the  potential  to  arrive  at  nonlocal,  whole  sequence- 
dependent  results.  We  also  hope  such  a  system  is  more 
‘stable’,  in  the  sense  that  the  predicted  sequence  may 
drift  less  from  the  true  meaning,  since  errors  will  not 
compound  with  each  time  step  in  the  same  way. 

There  are  many  potential  ways  to  architect  a  neural 
network  -  in  fact,  this  flexibility  is  one  of  deep  learn¬ 
ing’s  best  features  -  but  we  restrict  our  discussion  to 
the  structure  depicted  in  Figure  2.  In  this  setup,  we 
have  the  following  variables: 


data 

labels 

parameters 


X 

Y 

0 


and  functions: 


implicit  hidden  layer  definition 

loss  function 

input  layer  transformation 


H  =  F(0,  £,  H) 
L  =  1(6,  H,  Y) 
Z  =  9(9,X) 


Our  implicit  definition  function,  F,  is  made  up  of 
local  state  transitions  and  forms  a  system  of  nonlin¬ 
ear  equations  that  require  solving,  letting  hs  and  he  be 
boundary  states: 

lA  wider  stencil  can  also  be  used,  e.g.  f(ht-2,  ht~  1, . . .). 


hi  =  f(ha,h2,€  1) 

hi  =  f  {hi— 1 ,  hi+i ,  ^i) 

hn  f  {hn— 1 1  he^  £n) 
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Figure  2:  Proposed  iRNN  architecture 

2.3  Computing  the  forward  pass 

To  evaluate  the  network,  we  must  solve  the  equation 
H  =  F{H).  We  computed  this  via  an  approximate 
Newton  solve: 

Hn+1  =  Hn  —  (I  —  VHF)~l(Hn  -  F(Hn )) 

Since  (/  —  V hF)  is  sparse,  we  apply  Krylov  sub¬ 
space  methods  (Knoll  and  Keyes,  2004),  specifically 
BiCG-Stab  method  (Van  der  Vorst,  1992),  since  the 
system  is  non- symmetric.  This  has  the  added  advan¬ 
tage  of  only  relying  on  matrix- vector  multiplies  of  the 
gradient  of  F. 

We  also  considered  approximating  the  inverse  via 
a  polynomial  Pn(V#F),  using  the  geometric  series 
Pn{x)  =  1  +  x  +  x2  +  . . .  +  xn ,  which  converges 
provided  that  ||V#F||  <  1.  With  n  =  1,  this  proved 
a  reasonable  approximation  for  Part-of- Speech  tagging 
(Section  3.2).  For  other  tasks,  however,  we  found  the 
eigenvalues  were  not  as  well-behaved  with  this  approx¬ 
imation  scheme. 

2.4  Gradients 

In  order  to  train  the  model,  we  perform  gradient  de¬ 
scent.  We  take  the  gradient  of  the  loss  function: 

\7eL  =  Xet  +  XHWeH 

The  gradient  of  the  hidden  units  with  respect  to  the  pa¬ 
rameters  can  found  via  the  implicit  definition: 

X  qH  =  XeF  +  XhFX6H  +  X^FXei 
=  (I  —  W  hF)~1  (Ve-F  + 

where  the  factorization  follows  from  the  noting  that 

(I  -  VHF)WeH  =  VeF  +  VfFV^. 


The  entire  gradient  is  thus: 

X70L  =VHt{I  -  VhF)-1  (VeF  + 
+  7  et 


Once  again,  the  inverse  of  I  —  V#F  appears,  and  we 
can  compute  it  via  Krylov  subspace  methods. 

2.5  Transition  Functions 

Recall  the  original  GRU  equations  (Cho  et  al.,  2014), 
with  slight  notational  modifications: 


final  hidden 
candidate  hidden 
update  weight 
reset  gate 


ht  =  (1  -  zt)ht  +  ztht 

ht  =  tanh  (Wxt  +  U(rtht )  +  b) 

zt  =  cr(Wzxt  +  Uzht  +  bz) 

Tf  —  (j H-  U rht  &r) 


We  make  the  following  substitution  for  ht  (which 
was  set  to  ht-i  in  the  original  GRU  definition): 


state  comb, 
switch 
prev.  switch 
next  switch 


ht  —  sht— i  +  (1  —  s)ht+ 1 

s  =  Sp 

Sp+Sn  (2) 

Sp  —  (j  (\VpXt  H-  Upht-i  +  bp^j 

Sn  =  +  Unht+l  H-  ^n) 


This  modification  makes  the  architecture  both  im¬ 
plicit  and  bidirectional,  since  ht  is  a  linear  combination 
of  previous  and  future  hidden  states.  The  switch  vari¬ 
able  s  is  determined  by  a  competition  between  two  sig¬ 
moidal  units  sp  and  sn,  representing  the  contributions 
of  the  previous  and  next  hidden  states,  respectively. 

2.6  Implementation  Details 

We  implemented  the  iRNN  structure  using 
Theano  (Bergstra  et  al.,  2011).  The  product  V#Fu 
for  various  v,  required  for  the  BiCG-Stab  method, 
was  computed  via  the  Rop  operator.  In  computing 
VeL  (Equation  1),  we  noted  it  is  more  efficient  to 
compute  V h&{!  ~  V#F)_1  first,  and  thus  used  the 
Lop  operator. 

Batching  the  nonlinear  solver  was  slightly  tricky  - 
it  was  straightforward  to  perform  the  same  BiCG-stab 
computations  across  different  elements  in  the  batch,  but 
some  elements  converged  significantly  faster  than  oth¬ 
ers.  For  this  reason,  we  found  it  helpful  to  run  BiCG 
from  two  separate  initializations  for  each  element,  one 
selected  randomly  and  the  other  set  to  the  forward  ap¬ 
proximation  of  the  GRU  (omitting  Equation  2).  If  ei¬ 
ther  of  the  two  candidates  converged,  we  took  its  value 
and  stopped  computing  the  other. 


3  Experiments 

3.1  Biased  random  walks 

We  developed  an  artificial  task  with  bidirectional 
sequence-level  dependencies  to  explore  the  perfor¬ 
mance  of  our  model.  Our  task  was  to  find  the  point  at 
which  a  random  walk,  in  the  spirit  of  the  Wiener  Pro¬ 
cess  (Durrett,  2010),  changes  from  a  zero  to  nonzero 


mean.  We  trained  a  network  to  predict  when  the  walk 
is  no  longer  unbiased.  We  generated  algorithmic  data 
for  this  problem,  the  specifics  of  which  are  as  follows. 
First,  we  chose  an  integer  interval  length  N  uniformly 
in  the  range  1  to  40.  Then,  we  chose  a  (continuous) 
time  t'  G  [0,  TV),  and  a  direction  v  G  7 Zd.  We  pro¬ 
duced  the  input  sequence  Xi  G  7 Zd,  setting  x^  =  0 
and  iteratively  computing  Xi+ 1  =  Xi  +  AT ( 0,1).  After 
time  t,  a  bias  term  of  b  ■  v  was  added  at  each  time  step 
(b-v-(tf  —t))  for  the  first  time  step  greater  than  t' .  b  is  a 
global  scalar  parameter.  The  network  was  fed  in  these 
elements,  and  asked  to  predict  y  =  0  for  times  t  <  t' 
and  y  =  1  for  times  t  >  t' . 

For  each  architecture,  £  was  simply  the  unmodified 
input  vectors,  zero-padded  to  the  embedding  dimension 
size.  The  output  was  a  simple  binary  logistic  regres¬ 
sion.  We  produced  50,000  random  training  examples, 
2500  random  validation  examples,  and  5000  random 
test  examples.  The  implicit  algorithm  used  a  hidden 
dimension  of  200,  and  the  b-LSTM  had  an  embedding 
dimension  ranging  from  100  to  1000.  b-LSTM  dimen¬ 
sion  of  300  was  the  point  where  the  total  number  of 
parameters  were  roughly  equal. 

The  results  are  shown  in  Table  1.  The  b-LSTM 
scores  reported  are  the  maximum  over  sweeps  from 
100  to  1500  hidden  dimension  size.  The  iRNN  outper¬ 
forms  the  best  b-LSTM  in  the  more  challenging  cases 
where  the  bias  size  b  is  small. 


b 

iRNN  Error 

b-LSTM  Error 

2.0 

0.0226 

0.0210 

1.0 

0.0518 

0.0589 

0.75 

0.0782 

0.0879 

0.5 

0.119 

0.132 

0.25 

0.189 

0.205 

Table  1:  Biased  walk  classification  performance. 


3.2  Part-of-speech  tagging 

We  next  applied  our  model  to  a  real-world  problem. 
Part-of-speech  tagging  fits  naturally  in  the  sequence 
labeling  framework,  and  has  the  advantage  of  a  stan¬ 
dard  dataset  that  we  can  use  to  compare  our  network 
with  other  techniques.  To  train  a  part-of-speech  tag¬ 
ger,  we  simply  let  L  be  a  softmax  layer  transforming 
each  hidden  unit  output  into  a  part  of  speech  tag.  Ini¬ 
tially,  £  consisted  only  of  word  vectors  for  39,000  case- 
sensitive  vocabulary  words.  Next,  we  lowercased  the 
vocabulary  words,  but  added  a  single  feature  indicat¬ 
ing  whether  case  appeared  in  the  data.  Third,  we  added 
six  additional  ‘word  vector’  components  to  encode  the 
top-2000  most  common  prefixes  and  suffixes  of  words, 
for  affix  lengths  2  to  4.  Finally,  we  added  in  other 
(binary)  features  to  indicate  the  presence  of  numbers, 
symbols,  punctuation,  and  more  rich  case  data,  as  used 
by  (Huang  et  al.,  2015). 

We  trained  the  Part  of  Speech  (POS)  tagger  on  the 


Penn  Treebank  Wall  Street  Journal  corpus  (Marcus 
et  al.,  1993),  blocks  0-18,  validated  on  19-21,  and 
tested  on  22-24,  per  convention.  Training  was  done 
using  stochastic  gradient  descent,  with  an  initial  learn¬ 
ing  rate  of  0.5,  and  a  batch  size  of  20.  Word  vectors 
were  of  dimension  200,  prefix  and  suffix  vectors  were 
of  dimension  12.  Hidden  unit  size  was  equal  to  feature 
input  size,  so  in  this  case,  280.  Training  took  about  5 
seconds  per  batch  on  a  Tesla  K40  GPU. 

As  shown  in  Table  2,  the  iRNN  outperformed 
baselines  GRU,  LSTM,  b-LSTM,  all  with  500- 
dimensional  hidden  layers,  as  well  as  the  Stanford 
Part-of-Speech  tagger  (Toutanova  et  al.,  2003)  (model 
ws  j-0-18-bidirectional-distsim . tagger 
from  10-31-2016).  Note  that  performance  gains  past 
approximately  97%  are  difficult  due  to  errors/incon¬ 
sistencies  in  the  dataset,  ambiguity,  and  complex 
linguistic  constructions  including  dependencies  across 
sentence  boundaries  (Manning,  2011). 


Architecture 

WSJ  Accuracy 

GRU 

96.43 

LSTM 

96.47 

Bidirectional  GRU 

97.28 

b-LSTM 

97.25 

iRNN 

97.37 

Stanford  POS  Tagger 

97.33 

Table  2:  Tagging  performance  relative  to  other  recur¬ 
rent  architectures. 

3.2.1  Visualizations 

We  visualized  some  of  the  outputs  of  the  “switch”  vari¬ 
ables  for  various  sentences.  The  switch  is  made  up  of 
many  features,  and  it  does  not  necessarily  always  cor¬ 
respond  to  human  judgment,  but  by  taking  the  aver¬ 
age,  one  can  get  a  sense  of  the  flow  of  information. 
In  Figure  3,  we  see  a  visualization  of  the  switch  on  a 
very  simple  sentence,  and  in  Figure  4  we  see  it  in  ac¬ 
tion  over  a  more  complicated  sentence.  Interestingly, 
phrasal  structures  emerge. 

4  Conclusion  and  Future  Work 

We  have  introduced  a  novel,  implicitly  defined  neural 
network  architecture  based  on  the  GRU  and  shown  that 
it  outperforms  a  b-LSTM  on  an  artificial  random  walk 
task  and  slightly  outperforms  both  the  Stanford  Parser 
and  a  baseline  bidirectional  network  on  the  Penn  Tree- 
bank  Part-of-Speech  tagging  task. 

In  future  work,  we  intend  to  consider  implicit  varia¬ 
tions  of  other  archetectures,  such  as  the  LSTM,  as  well 
as  additional,  more  challenging,  and/or  data-rich  appli¬ 
cations. 
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Figure  3:  Visualization  of  the  switch  variable,  with 
True/Predicted  POS  tags.  Positive  values  indicate  a 
right-to-left  flow  of  information,  while  negative  values 
indicate  left-to-right.  Note  how  ‘Tokyo’  modifies  ‘mar¬ 
ket’  ,  and  information  flows  between  them. 
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Figure  4:  A  more  complicated  example,  more  repre¬ 
sentative  of  the  WSJ  corpus. 
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